LLM 25-Day Course - Day 4: Complete Guide to Transformer Architecture

Day 4: Complete Guide to Transformer Architecture

The Transformer, introduced in the 2017 paper “Attention is All You Need,” is the foundation of modern LLMs. It completely replaced RNN/LSTM and is suitable for large-scale training thanks to its parallel processing capability.

Overall Transformer Structure

Input Text -> [Token Embedding + Positional Encoding]
                    |
         +-------------------------+
         |    Encoder Block (xN)    |
         |  +---------------------+ |
         |  | Self-Attention       | |
         |  | + Residual + LN      | |
         |  +---------------------+ |
         |  | Feed-Forward         | |
         |  | + Residual + LN      | |
         |  +---------------------+ |
         +-------------------------+
                    |
         +-------------------------+
         |    Decoder Block (xN)    |
         |  +---------------------+ |
         |  | Masked Self-Attn     | |
         |  +---------------------+ |
         |  | Cross-Attention      | |
         |  +---------------------+ |
         |  | Feed-Forward         | |
         |  +---------------------+ |
         +-------------------------+
                    |
              Output Text

The GPT series uses only the decoder, while BERT uses only the encoder.

Simplified Self-Attention Implementation

import numpy as np

def self_attention(query, key, value):
    """Scaled Dot-Product Attention"""
    d_k = query.shape[-1]

    # 1. Compute similarity via dot product of Query and Key
    scores = np.matmul(query, key.T) / np.sqrt(d_k)

    # 2. Normalize weights with Softmax
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # 3. Apply weights to Values
    output = np.matmul(attention_weights, value)
    return output, attention_weights

# 3 tokens, 4-dimensional vectors
seq_length, d_model = 3, 4
x = np.random.randn(seq_length, d_model)

output, weights = self_attention(x, x, x)
print(f"Attention weights:\n{weights.round(3)}")
print(f"Output shape: {output.shape}")  # (3, 4)

Feed-Forward Network and Layer Normalization

import numpy as np

def layer_norm(x, eps=1e-6):
    """Layer Normalization: normalizes each token vector"""
    mean = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

def feed_forward(x, w1, b1, w2, b2):
    """Position-wise Feed-Forward Network"""
    # Expand dimensions then contract (typically 4x expansion)
    hidden = np.maximum(0, np.matmul(x, w1) + b1)  # ReLU activation
    output = np.matmul(hidden, w2) + b2
    return output

def transformer_block(x, w1, b1, w2, b2):
    """A single Transformer block"""
    # Self-Attention + Residual Connection + Layer Norm
    attn_output, _ = self_attention(x, x, x)
    x = layer_norm(x + attn_output)  # Residual connection

    # Feed-Forward + Residual Connection + Layer Norm
    ff_output = feed_forward(x, w1, b1, w2, b2)
    x = layer_norm(x + ff_output)    # Residual connection
    return x

# Initialization and execution
d_model, d_ff = 4, 16
w1 = np.random.randn(d_model, d_ff) * 0.1
b1 = np.zeros(d_ff)
w2 = np.random.randn(d_ff, d_model) * 0.1
b2 = np.zeros(d_model)

x = np.random.randn(3, d_model)
output = transformer_block(x, w1, b1, w2, b2)
print(f"Input shape: {x.shape}, Output shape: {output.shape}")

Summary of Key Components

Component	Role	Key Idea
Self-Attention	Captures relationships between tokens	Every token attends to every other token
Feed-Forward	Non-linear transformation	Expand then contract dimensions
Layer Normalization	Stabilizes training	Normalizes the output of each layer
Residual Connection	Ensures gradient flow	Adds input to output
Positional Encoding	Provides token order information	Without it, word order is ignored

The Transformer is a combination of these five components. Tomorrow we’ll dive deeper into the most critical component: the Attention mechanism.

Today’s Exercises

Explain why deep networks are difficult to train without Residual Connections. Relate your answer to the vanishing gradient problem.
Stack the transformer_block function 6 times to build a 6-layer Transformer. Verify that the input/output shapes are preserved.
Summarize which tasks encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures are each best suited for.